Add generation server scripts using HF accelerate and DS-inference #328

mayank31398 · 2022-08-11T08:04:28Z

Currently, the scripts are working correctly.
However, there is some memory leak.
In huggingface/accelerate#614 @sgugger says that its not in accelerate which is probably true.

My hunch is a related bug: apache/mxnet#19159

mayank31398 · 2022-08-11T08:04:42Z

This is still WIP
@stas00

stas00

thank you for working on refactoring and the server code, @mayank31398

let's create libraries rather than having direct scripts importing from the server scripts, so both types import the same library. So probably the following structure:

bloom-accelerate-inference.py        # library sharing code
bloom-accelerate-inference-cli.py    # the original script importing from bloom-accelerate-inference.py
bloom-accelerate-inference-server.py # the server script importing from bloom-accelerate-inference.py

same for other sets. e.g., bloom-deepspeed-inference.py, bloom-deepspeed-zero-inference.py, ...

This is of course just a suggestion, I'm happy to hear other ideas.

scripts/inference/README.md

mayank31398 · 2022-08-16T22:17:01Z

@stas00, I believe the code is in good shape. You can start reviewing.
Also, I'll start updating the README for instructions tomorrow

inference/cli.py

stas00 · 2022-08-17T17:56:32Z

Oh wow, you developed a whole package around this. Impressive work, @mayank31398

Let's just discuss the user-facing APIs before implementing to minimize wasting time.

Originally, these were just scripts to benchmark and demo how the various solutions can be used, but now I'm not so sure how demo-friendly these are with so many separate files.

Some design questions:

Why does CLI import from the server code?
would it really make things simpler to have one cli script for all the different systems? they are quite different after all with very different arguments - e.g. ds-zero will need nvmi path as well and flags for cpu/nvme offload? I just had those hardcoded - but there are 3 ways to invoke this one. gpu/cpu-offload/nvme-offload
the while loop in cli.py isn't debug friendly - can we activate it only with an explicit flag, but by default just use pre-made inputs - or perhaps 2 separate scripts? I'm not even sure who would ever want to manually type text - perhaps having a flag to feed it a file with prompts instead? not sure.
multiple input BS>1 functionality got lost in transition
naming of the scripts lost their mnemonics - cli.py is very ambiguous since it tells me nothing of what it does. Please see my suggestions at Add generation server scripts using HF accelerate and DS-inference #328 (review), but I suppose that the way it's going it's no longer bloom-specific either, and can probably work with any other model.

I'm also no longer sure what to do with this package. Originally the plan was to stash the original demo scripts under transformers' research files, but now it's growing towards an independent package. Not a bad thing, I'm just thinking aloud.

mayank31398 · 2022-08-17T19:11:52Z

I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.
I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.
I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.
batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now).
I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.

Let me know your thoughts on this. Thanks for the review :)

mayank31398 · 2022-08-17T19:16:52Z

Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM.
I can open up another PR to support other models.
Adding so much functionality in a single PR might not be the best thing to do.

stas00 · 2022-08-17T19:42:39Z

I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.

so you are not benchmarking the same thing then.

you don't need to broadcast anything for non-server usage. It sounds like you built cli.py as a client to the server, but that's too complicated for someone who just wants to process a few inputs.

The intention of the cli was the way it was in the original code - tokenize + generate + detokenize - done.

Perhaps we have miscommunicated here.

2. I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.

I answered earlier in the code comment.

3. I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.

I meant the script should work out of the box and require no user input. I think we have a conflict of use cases here. I wanted an easy to understand and replicate demo, you want a standalone system.

Perhaps the right solution is to just have the original scripts as they are for the demo/benchmarking purpose and then your package that is for the production use. What do you think? There will be a bit of duplication, but as you're saying you don't care for solutions that are not fast enough, which is fair for your needs.

What crashes?

4. batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now).

5. I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.

for end user scripts there should be no need to import them.

Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM.
I can open up another PR to support other models.
Adding so much functionality in a single PR might not be the best thing to do.

agreed, let's stick to bloom for now. And that's why I suggested that then the naming of the scripts should indicate that it's bloom and your version doesn't.

mayank31398 · 2022-08-17T20:55:32Z

you don't need to broadcast anything for non-server usage. It sounds like you built cli.py as a client to the server, but that's too complicated for someone who just wants to process a few inputs.

cli works this way. The user only needs to provide input_text. Also, I need to call generate method on all 8 GPU processes (in non-server usage), and since the input text will only be requested on 1 process. You need to broadcast it as far as I understand.
Yes cli for ds-inference is built as a client to the server. I can change it though if we want.
In the original ds-inference script @stas00 , you didnt need to broadcast since each process was creating the list.

I will incorporate your suggestions.

mayank31398 · 2022-08-17T20:56:11Z

So, just to summarize for cli.py:
You want the option for user to provide an input text file?

stas00 · 2022-08-24T20:23:20Z

I feel like empty prompt is fine. This is basically unconditional generation right?

Good question - I have never tried it - I guess it should work.

mayank31398 · 2022-08-24T22:32:52Z

Also, @stas00 I have been meaning to ask.
Is accelerate using DeepSpeed ZeRO as its backend?
Because if that is the case then the generation time per batch for both of them differ greatly right?

stas00 · 2022-08-24T23:00:44Z

Is accelerate using DeepSpeed ZeRO as its backend?

It can but it's not using it in our scripts here. It uses a naive pipeline parallelism, assigning different layers to different devices and switching between those.

mayank31398 · 2022-08-27T10:33:51Z

@stas00 , while working on this, found a really interesting bug. Could be a severe one.
I have provided a minimal working example.

mayank31398 · 2022-08-28T02:30:33Z

from huggingface_hub import snapshot_download

I am still using the old method (it is still working for >=0.9.0) if the user inputs the model_name as bigscience/bloom in ds-inference. I am doing this because snapshot _download is not respecting the TRANSFORMERS_CACHE and I don't want it to download duplicate checkpoints (for bigscience/bloom). However, this is slow (reshards the 72 files)
For the newer weights provided by microsoft, I am using snapshot_download. Also, I am adding support for int8 as well in this PR. However, this is much faster.

I feel like this is the best approach for now. If we want to change this we can do that in a separate PR in the future.

@stas00

stas00 · 2022-08-28T02:46:28Z

because snapshot _download is not respecting the TRANSFORMERS_CACHE

If you use HUGGINGFACE_HUB_CACHE instead of TRANSFORMERS_CACHE it would work for either.

Alternatively you can do:

HUGGINGFACE_HUB_CACHE = os.getenv("HUGGINGFACE_HUB_CACHE", None)
cache_dir =  = os.getenv("TRANSFORMERS_CACHE", HUGGINGFACE_HUB_CACHE)
snapshot_download(..., cache_dir=cache_dir)

If neither is set, None would lead to the default path.

You can see:

https://github.com/huggingface/transformers/blob/21f6f58721dd9154357576be6de54eefef1f1818/src/transformers/utils/hub.py#L92

stas00 · 2022-08-28T16:05:56Z

Actually, the first line isn't needed, as it's already the default:
https://github.com/huggingface/huggingface_hub/blob/7aadc3243d9478a48e17f568d8d4950cafd11ca2/src/huggingface_hub/_snapshot_download.py#L103-L104

so just:

cache_dir =  = os.getenv("TRANSFORMERS_CACHE", None)
snapshot_download(..., cache_dir=cache_dir)

mayank31398 · 2022-08-29T02:02:19Z

DeepSpeed-MII not working with new Microsoft weights: microsoft/DeepSpeed-MII#50

mayank31398 · 2022-08-31T12:17:37Z

@stas00 https://huggingface.co/bigscience/bloom/tree/main
Some changes were pushed to bigscience/bloom just 3 hours ago.
snapshot_download fails as a result of this change.

  File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/ds_zero/model.py", line 17, in __init__
    return snapshot_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 93, in inner_f
    downloaded_model_path = get_downloaded_model_path(args.model_name)
  File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/utils/model.py", line 105, in get_downloaded_model_path
    return f(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 192, in snapshot_download
    return snapshot_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 93, in inner_f
    _ = hf_hub_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1218, in hf_hub_download
    return f(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 192, in snapshot_download
    _ = hf_hub_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1218, in hf_hub_download
    _create_relative_symlink(blob_path, pointer_path)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 837, in _create_relative_symlink
    _create_relative_symlink(blob_path, pointer_path)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 837, in _create_relative_symlink
    os.symlink(relative_src, dst)
FileExistsError: [Errno 17] File exists: '../../../../blobs/c4072f2f02baee6f0d50739cf14439f61e0605913ed65606f9bcb11043838a2e' -> '/home/gpttest/nfs/mayank/HF_cache/models--bigscience--bloom/snapshots/874fa447d586d0a28ab66aa3ca49da43ae1ec0f6/tensorboard/main/events.out.tfevents.1649441015.jean-zay-iam50.156467.0'

Any idea how to get around this?
Also, if this is fixed, would snapshot_download download the whole checkpoints again?
This would be highly impractical right?

Please let me know this. I am not sure of a good solution.
Also, clearing cache for every commit might not be practical.

stas00 · 2022-08-31T18:45:40Z

it looks that only README.md was updated, so it's weird that it'd fail.

I just run:

python -c "from huggingfhub import snapshot_download; snapshot_download('bigscience/bloom')"

and it updated w/o any issues.

Probably move that file it complains about for now? Could be some bug in huggingface_hub?

it shouldn't redownload anything but the new or updated files.

stas00 · 2022-08-31T19:08:16Z

@mayank31398, please also note the renames in this commit https://github.com/microsoft/DeepSpeed/pull/2217/files/c77f5e0e385db405475b2842a6f5725c0f551170..f3f4b1dda9f9ba09042b8b63bbee475c9364bc56 in particular s/mp_size/tp_size/

this is now merged so we should update our side - you can just do a global search and replace on all of the files under scripts - or if you'd prefer I do that please let me know - don't want to step on your toes if you're still on the old DS branch.

mayank31398 · 2022-09-01T01:04:45Z

Let me try updating to the latest HF_hub, and deepsped to see if that fixes things.
Nvm, I am on the latest version.
I agree this should work. @stas00
Its working fine for GPT-2.
No idea what is going wrong though for BLOOM. For now, Ill just remove this file.

mayank31398 · 2022-09-01T04:17:32Z

Nevermind, I have fixed this. This was a bug because all 8 processed in my DeepSpeed code were trying to download simultaneously

stas00 · 2022-09-01T04:51:36Z

oddly enough It doesn't look like mp_size was renamed to tp_size, which is odd, so I only had to modify this line with the updated checkpoints:

checkpoints_json = os.path.join(repo_root, "ds_inference_config.json")

stas00 · 2022-09-01T15:36:52Z

@mayank31398, would it be OK if we merged this PR - and then perhaps have another iteration of tweaks in subsequent PRs if needed? As we are renaming files I want the rename to be merged first before tweaking those.

You have the commit rights, so you can do the honours if you're in agreement.

Thank you!

stas00

Let's merge this

mayank31398 · 2022-09-01T17:04:47Z

Yes @stas00 , currently, MII is not working with DS-inference. But we can merge and I can continue to work on this.

mayank31398 · 2022-09-01T17:08:59Z

Also, @stas00 there are some pieces of code that both the server folder and the scripts folder can share.
I will work on getting the reduntant code out once all the stuff in server stuff is fixed :)

stas00 · 2022-09-01T17:09:47Z

perfect, then you can start a new PR once you have some new edits. We can just do it in iterations.

I need to edit the original scripts to support the new repos. will do in another PR.

…igscience-workshop#328) * first step towards making libs * HF accelerate model * refactor accelerate * refactor DS inference * refactor DS ZeRO * make inference library * cli * server * request * remove MaxTokensError * fix batch size error with DS inference server * type fix * add latency * add latency * add min_length to default kwargs * str kwargs * str kwargs * fix comma * add old scripts back * move scripts * drop data * minor changes + add README * update README * drop nccl * fix * default values * resolve issues * handle keyboard interrupt * remove caching * use snapshot_download * make server class * fix snapshot download Co-authored-by: Mayank Mishra <mayank31398@gmail.com>

stas00 reviewed Aug 11, 2022

View reviewed changes

scripts/inference/README.md Outdated Show resolved Hide resolved

Mayank Mishra added 3 commits August 12, 2022 15:48

first step towards making libs

47f3fc0

HF accelerate model

435af43

refactor accelerate

d1676fc

mayank31398 force-pushed the add-generation-server branch from 5bc3d84 to d1676fc Compare August 13, 2022 09:10

Mayank Mishra added 5 commits August 13, 2022 18:55

refactor DS inference

eef490c

refactor DS ZeRO

25d0c70

make inference library

7be1410

cli

5c31d9a

server

2905955

mayank31398 mentioned this pull request Aug 15, 2022

[Tensorboard] Log text prediction in evaluation #163

Open

Mayank Mishra added 3 commits August 16, 2022 19:01

request

12c4cf7

remove MaxTokensError

46ade32

fix batch size error with DS inference server

c46d957

Mayank Mishra added 5 commits August 17, 2022 03:51

type fix

f3dac05

add latency

4425614

add latency

c97d6ea

add min_length to default kwargs

f3385f2

str kwargs

8f25200

mayank31398 force-pushed the add-generation-server branch from 792cda0 to 8f25200 Compare August 17, 2022 11:22

str kwargs

b11bb7f

stas00 reviewed Aug 17, 2022

View reviewed changes

inference/cli.py Outdated Show resolved Hide resolved

Mayank Mishra added 3 commits August 25, 2022 04:34

resolve issues

403424b

handle keyboard interrupt

493b2ee

remove caching

c84d9b7

Mayank Mishra added 2 commits August 29, 2022 08:51

use snapshot_download

81d1469

make server class

a85b488

mayank31398 mentioned this pull request Aug 29, 2022

DeepSpeed inference support for int8 parameters on BLOOM? #330

Closed

fix snapshot download

43a844c

stas00 approved these changes Sep 1, 2022

View reviewed changes

mayank31398 merged commit f9402d0 into main Sep 1, 2022

mayank31398 deleted the add-generation-server branch September 1, 2022 17:06

mayank31398 mentioned this pull request Sep 1, 2022

Followup PR for adding generation-server #339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generation server scripts using HF accelerate and DS-inference #328

Add generation server scripts using HF accelerate and DS-inference #328

mayank31398 commented Aug 11, 2022

mayank31398 commented Aug 11, 2022

stas00 left a comment •

edited

Loading

mayank31398 commented Aug 16, 2022

stas00 commented Aug 17, 2022 •

edited

Loading

mayank31398 commented Aug 17, 2022 •

edited

Loading

mayank31398 commented Aug 17, 2022

stas00 commented Aug 17, 2022

mayank31398 commented Aug 17, 2022 •

edited

Loading

mayank31398 commented Aug 17, 2022

stas00 commented Aug 24, 2022

mayank31398 commented Aug 24, 2022

stas00 commented Aug 24, 2022

mayank31398 commented Aug 27, 2022

mayank31398 commented Aug 28, 2022

stas00 commented Aug 28, 2022

stas00 commented Aug 28, 2022

mayank31398 commented Aug 29, 2022

mayank31398 commented Aug 31, 2022 •

edited

Loading

stas00 commented Aug 31, 2022 •

edited

Loading

stas00 commented Aug 31, 2022 •

edited

Loading

mayank31398 commented Sep 1, 2022 •

edited

Loading

mayank31398 commented Sep 1, 2022

stas00 commented Sep 1, 2022

stas00 commented Sep 1, 2022

stas00 left a comment

mayank31398 commented Sep 1, 2022

mayank31398 commented Sep 1, 2022

stas00 commented Sep 1, 2022 •

edited

Loading

Add generation server scripts using HF accelerate and DS-inference #328

Add generation server scripts using HF accelerate and DS-inference #328

Conversation

mayank31398 commented Aug 11, 2022

mayank31398 commented Aug 11, 2022

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

mayank31398 commented Aug 16, 2022

stas00 commented Aug 17, 2022 • edited Loading

mayank31398 commented Aug 17, 2022 • edited Loading

mayank31398 commented Aug 17, 2022

stas00 commented Aug 17, 2022

mayank31398 commented Aug 17, 2022 • edited Loading

mayank31398 commented Aug 17, 2022

stas00 commented Aug 24, 2022

mayank31398 commented Aug 24, 2022

stas00 commented Aug 24, 2022

mayank31398 commented Aug 27, 2022

mayank31398 commented Aug 28, 2022

stas00 commented Aug 28, 2022

stas00 commented Aug 28, 2022

mayank31398 commented Aug 29, 2022

mayank31398 commented Aug 31, 2022 • edited Loading

stas00 commented Aug 31, 2022 • edited Loading

stas00 commented Aug 31, 2022 • edited Loading

mayank31398 commented Sep 1, 2022 • edited Loading

mayank31398 commented Sep 1, 2022

stas00 commented Sep 1, 2022

stas00 commented Sep 1, 2022

stas00 left a comment

Choose a reason for hiding this comment

mayank31398 commented Sep 1, 2022

mayank31398 commented Sep 1, 2022

stas00 commented Sep 1, 2022 • edited Loading

stas00 left a comment •

edited

Loading

stas00 commented Aug 17, 2022 •

edited

Loading

mayank31398 commented Aug 17, 2022 •

edited

Loading

mayank31398 commented Aug 17, 2022 •

edited

Loading

mayank31398 commented Aug 31, 2022 •

edited

Loading

stas00 commented Aug 31, 2022 •

edited

Loading

stas00 commented Aug 31, 2022 •

edited

Loading

mayank31398 commented Sep 1, 2022 •

edited

Loading

stas00 commented Sep 1, 2022 •

edited

Loading